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Abstract 

Background: An important step in the reconstruction of a metabolic network is annotation of metabolites. 
Metabolites are generally annotated with various database or structure based identifiers. Metabolite annotations in 
metabolic reconstructions may be incorrect or incomplete and thus need to be updated prior to their use. 
Genome-scale metabolic reconstructions generally include hundreds of metabolites. Manually updating annotations 
is therefore highly laborious. This prompted us to look for open-source software applications that could facilitate 
automatic updating of annotations by mapping between available metabolite identifiers. We identified three 
applications developed for the metabolomics and chemical informatics communities as potential solutions. The 
applications were MetMask, the Chemical Translation System, and UniChem. The first implements a "metabolite 
masking" strategy for mapping between identifiers whereas the latter two implement different versions of an InChl 
based strategy. Here we evaluated the suitability of these applications for the task of mapping between metabolite 
identifiers in genome-scale metabolic reconstructions. We applied the best suited application to updating identifiers 
in Recon 2, the latest reconstruction of human metabolism. 

Results: All three applications enabled partially automatic updating of metabolite identifiers, but significant manual 
effort was still required to fully update identifiers. We were able to reduce this manual effort by searching for new 
identifiers using multiple types of information about metabolites. When multiple types of information were 
combined, the Chemical Translation System enabled us to update over 3,500 metabolite identifiers in Recon 2. All but 
approximately 200 identifiers were updated automatically. 

Conclusions: We found that an InChl based application such as the Chemical Translation System was better suited 
to the task of mapping between metabolite identifiers in genome-scale metabolic reconstructions. We identified 
several features, however, that could be added to such an application in order to tailor it to this task. 

Keywords: Metabolic network reconstruction, Metabolite identifiers, Automation, MetMask, The Chemical 
Translation System, UniChem, Recon 2 
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Background 

Metabolic network reconstructions are knowledge bases 
of key components in the metabolic reaction network of 
a particular organism or cell type [1,2]. Reconstructions 
vary in size from small-scale reconstructions where only 
a few pathways of interest are included, to genome-scale 
reconstructions where the aim is to include all known 
components in a network. The components of a metabolic 
network reconstruction are the metabolic reactions in the 
network, the metabolites that participate in those reac- 
tions, the enzymes that catalyze the reactions, and the 
genes that encode those enzymes. Individual components 
are linked with mathematical structures that enable com- 
putational analysis of the network as an integrated system. 
Computational analysis of genome-scale metabolic recon- 
structions has, for example, been used for biochemical 
[3,4], biomedical [5-7], and bioengineering [8,9] purposes. 

An important step in the reconstruction of a metabolic 
network is the annotation of reconstruction compo- 
nents [2]. The identities and functions of components 
included in metabolic network reconstructions are gen- 
erally known. Annotation here refers to the process of 
attaching various metadata for components to a recon- 
struction. These annotations serve to unambiguously 
identify components and enable efficient mapping of 
data to the reconstruction for analysis. Components are 
generally annotated with an appropriate type of identi- 
fier, e.g., an Entrez gene ID for genes and an Enzyme 
Classification (EC) number for enzymes. Metabolites are 
usually annotated with several types of identifiers. A com- 
prehensive protocol for genome-scale metabolic recon- 
struction [2] recommended annotating metabolites with 
a primary identifier in at least one of the following 
three databases: ChEBI [10], KEGG Compound [11,12], 
or PubChem Compound [13]. The protocol also rec- 
ommended annotating metabolites with structure-based 
identifiers, such as the IUPAC International Chemical 
Identifier (InChI) and the Simplified Molecular-Input 
Line-Entry System (SMILES). We add that further anno- 
tation with primary identifiers in databases that are spe- 
cific to the reconstructed organism is also advisable, e.g., 
the Human Metabolome Database (HMDB) [14,15] for 
human metabolic reconstructions. 

Database identifiers have the advantage of providing 
direct links to data that are stored in each database. 
Data types that can be mapped to the reconstruction 
via metabolite identifiers include physicochemical data, 
metabolomics data, metabolic pathways and metabo- 
lite structures. Different types of data are available 
in each database. Whereas KEGG has more data on 
metabolic genes, enzymes and reactions, HMDB provides 
more information on metabolites. Since not all chemical 
databases provide cross-references to all other databases, 
it is usually not enough to annotate metabolites with only 



one type of identifier. Instead they should be annotated 
with identifiers in as many relevant databases as possible. 
Multiple annotations also aid in identification since any 
one database may not contain all metabolites in a given 
reconstruction. 

Advantages of structure-based identifiers are that they 
are unambiguous and database independent. InChls and 
SMILES strings can also be converted to metabolite struc- 
tures that can be used directly for various computational 
analyses [16-19]. Although SMILES strings have a sim- 
pler syntax and are more human readable, InChls are 
preferable in many ways [20]. Firstly, they have a lay- 
ered structure that makes them highly flexible and easy to 
manipulate. Secondly, they can account for tautomerism. 
Multiple tautomers of the same compound can be repre- 
sented with the same standardized or "standard" InChI. 
Alternatively, a specific tautomer can be represented with 
a nonstandard InChI. A third advantage of InChls is that 
the InChI algorithm in non-propriatory and is imple- 
mented in open source software. Version 1 of the InChI is 
currently in use. 

A disadvantage of the InChI is that its length increases 
with molecular size and level of structural detail. Also, it 
includes non-alphabetical characters such as /, \, - and +. 
These features make the InChI inconvenient for inter- 
net and database searches [20]. A hashed version, the 
InChlKey, was therefore created [20] . The InChlKey has 
a fixed length of 27 characters, and only includes upper- 
case English letters and dashes. These features also make 
it a good choice as a database independent identifier for 
metabolites in metabolic reconstructions. Most chemical 
databases now include an InChlKey (as well as an InChI 
and SMILES string) in each database entry. The hashing 
algorithm that generates InChlKeys from InChls is not 
reversible [20], meaning that there is no algorithm that 
can convert an InChlKey back to an InChI. InChlKeys are 
therefore not directly convertible to metabolite structure. 
To retrieve an InChI from an InChlKey it is necessary to 
use a lookup table or a chemical structure resolver such as 
the Chemical Identifier Resolver [21] or ChemSpider [22]. 

Genome-scale reconstructions usually include hun- 
dreds of metabolites. The latest human reconstruction, 
Recon 2, includes over 2500 metabolites [23]. Manual 
annotation of such a large number of metabolites, with 
multiple identifiers each, is extremely laborious. Metabo- 
lites in early reconstructions were generally annotated 
manually. Today, reconstruction tools such as the Model 
SEED [24], rBioNet [25] and the SuBliMinaL Toolbox 
[26] facilitate the process by populating new recon- 
structions with pre-annotated components from source 
databases. Metabolites in the source databases, however, 
may have incomplete or incorrect annotations, which 
will then be propagated to all new reconstructions. 
Metabolites in existing reconstructions may likewise have 
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incomplete or incorrect annotations. Metabolite annota- 
tions in metabolic reconstructions may therefore need to 
be updated prior to their use. Software applications that 
enable automatic updating of annotations are desirable. 

The SuBliMinaL Toolbox comes with an annotation 
module that can be used to retrieve ChEBI identifiers 
for metabolites by searching the ChEBI database with 
metabolite names. A recently published annotation tool 
called Metingear [27] also features a name search for 
database identifiers, but against multiple databases. In 
our experience, metabolite names are poor identifiers. 
Most metabolites have several synonyms and databases 
differ in the use of those synonyms. Searching a database 
with a metabolite name may therefore yield no results if 
the metabolite is registered under a different synonym. 
The same name is also often associated with multiple 
stereoisomers of the same compound, across different 
databases and within the same database. The name dex- 
trose, for example, is associated with four entries in 
PubChem Compound: D-glucose (PubChem Compound 
ID (CID) 5793), a-D-glucose (PubChem CID 79025), p- 
D-glucose (PubChem CID 64689) and a generic hexopy- 
ranose (PubChem CID 206). As the last entry (PubChem 
CID 206) demonstrates, names are also sometimes asso- 
ciated with the incorrect structures. Numerous other 
examples of incorrect associations between names and 
structures are given in [28]. A name search can therefore 
yield a list of several candidate identifiers that must be 
sorted through manually to find the one that best matches 
the target compound. It is therefore not conducive to 
automatic updating of identifiers. 

A name search is the best option available for updat- 
ing identifiers for metabolites that are only annotated with 
a name. However, most metabolites in source databases 
for metabolic reconstruction tools are annotated with 
at least one identifier besides name. The same goes 
for metabolites in existing metabolic reconstructions, 
especially those that were built according to the afore- 
mentioned protocol [2]. The non-name identifiers are 
generally more specific than names as they refer to spe- 
cific structures or database entries. Software applications 
that enable mapping between non-name identifiers could 
therefore facilitate automatic updating of metabolite iden- 
tifiers in metabolic network reconstructions. 

The problem of annotating large sets of metabolites 
is well known in metabolomics and chemical informat- 
ics. Applications that can be used to partially automate 
annotation have been developed for these fields. We 
searched among these for applications that were suitable 
for mapping between metabolite identifiers in metabolic 
reconstructions. We only considered open-source appli- 
cations as these can readily be adapted to the needs of 
the metabolic reconstruction community and integrated 
into metabolic reconstruction tools. Three applications 



that met these criteria were MetMask [29], the Chem- 
ical Translation System (CTS) [30] and UniChem [31]. 
These applications implement annotation strategies that 
go beyond name search. They enable mapping between 
multiple types of identifiers, including chemical names. 
Different annotation strategies are implemented in each 
of the three applications. Here, we compare these appli- 
cations, to determine which annotation strategy is best 
suited for annotation of metabolites in genome scale 
metabolic reconstructions. We then apply the top appli- 
cation to update annotations of metabolites in the latest 
human reconstruction Recon 2 [23]. 

Applications 
MetMask 

MetMask [29] is a desktop application for creating and 
querying custom local databases of identifier groups 
or "metabolite masks" Identifier groups from multiple 
sources, such as public databases and private chemi- 
cal libraries, can be imported into the same MetMask 
database. We imported identifier groups from Recon 2, 
HMDB and ChEBI. MetMask merges groups that are 
deemed compatible by the applications heuristics. A Met- 
Mask database can be queried with an identifier of one 
type (e.g., synonym) to find other identifiers of either the 
same or a different type (e.g., InChlKey) that belong to the 
same mask. Metmask is available from http://metmask. 
sourceforge.net/. 

The Chemical Translation System 

CTS [30] is a web application for mapping between 
chemical identifiers. It covers 215 types of identifiers, 
including chemical names, structure-based identifiers, 
and database identifiers. Queries are sent to a single 
database where data from multiple external databases has 
been aggregated. Identifiers are matched based on "stan- 
dard" InChlKeys, which are generated from "standard" 
InChls, i.e., InChls produced with standard options set- 
tings. Standard InChls and InChlKeys are not tautomer 
specific. CTS finds all standard InChlKeys that are linked 
to an input identifier and returns all identifiers of the 
requested output type(s) that are linked to the same stan- 
dard InChlKeys. Web services and a web user interface for 
CTS are available at http://cts.fiehnlab.ucdavis.edu. 

UniChem 

UniChem [31] is a web application that was designed 
for automatic generation of cross-references between dif- 
ferent databases, but can also be used to map between 
chemical identifiers. It is similar to CTS in that identifiers 
are matched based on standard InChls and InChlKeys. 
Queries to UniChem are also sent to a single database 
where data from multiple external databases has been 
aggregated. 
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There are two major differences between UniChem and 
CTS that are relevant to metabolite annotation. The first 
is that UniChem covers a much lower number of iden- 
tifier types. At the time of writing, it covered identifiers 
in 18 public databases in addition to standard InChls 
and InChlKeys. Covered databases included KEGG Com- 
pound, ChEBI and HMDB, but not PubChem Compound. 
Chemical names were not covered. The second major dif- 
ference between UniChem and CTS is that data from 
external databases must pass various quality checks before 
they are imported into the UniChem database. External 
database entries are generally required to include at least 
a database identifier and a standard InChI to be included 
in the UniChem database. The quality checks performed 
by UniChem include checking whether a standard InChI 
in an entry can be converted to a standard InChlKey. If 
the entry also includes a standard InChlKey, it is checked 
against the standard InChlKey generated from the stan- 
dard InChI. Entries that fail these checks are excluded 
from the UniChem database as either the standard InChI, 
standard InChlKey, or both can then be assumed to be 
invalid. In addition to these quality checks, UniChem 
keeps track of which database identifiers are currently 
associated with a given InChI and which identifiers were 
associated with that InChI in the past. It does not output 
obsolete associations unless it is requested by the user. 

These differences between UniChem and CTS stem 
from the fact that they were designed for different pur- 
poses. CTS was designed for metabolite annotation and 
emphasizes coverage, whereas UniChem was designed 
for automatic database cross-referencing and emphasizes 
specificity. We include UniChem here to assess the value 
of UniChem-like quality checks in metabolite annotation. 
Web services and a web user interface for UniChem are 
available at https://www.ebi.ac.uk/unichem/. 

Results 

Identifier mapping tests 

The tests described in Methods revealed a number of 
differences between the three mapping applications. The 
overall performance of each application is quantified in 
Table 1. Performance on individual mapping tests is given 
in Additional file 1: Table SI. UniChem performed best on 
tests involving identifier types that it covered. UniChem 
generally only returned the preferred output identifier, for 
each input identifier that was associated with at least one 
output identifier in the UniChem database. Many input 
identifiers, however, were not associated with any output 
identifier. This was also the case for CTS, which indicates 
that it is a characteristic of InChI based mapping strate- 
gies. The fact that no output identifier of a particular type 
is returned for a given input identifier does not necessar- 
ily mean that the input identifier is not in the database. 
It only means that no identifier of the requested output 



Table 1 Quantified overall performance of the three 



mapping applications 







Mean counts 








In 


Hits 


Out 


Matches 


Mean score 


MetMask 


99 


98 


146 


93 


0.63 


CTS 


99 


80 


105 


75 


0.57 


UniChem 


99 


74 


77 


72 


0.70 



Scores were calculated as described in Section Scoring. Means were taken over 
all input-output pairs that were covered by each application. The means reflect 
well general characteristics of each application observed in individual tests (see 
Additional file 1: Table S1). 



type is associated with the exact same standard InChI. A 
single compound can have more than one valid structure, 
each represented with a distinct standard InChI. The dis- 
tinct, equally valid structures are usually stereoisomers. If 
two databases associate different stereoisomers with their 
respective entries for the same compound, the two entries 
cannot be linked through InChls. The KEGG Compound 
and HMDB entries for lactose are good examples of this 
(see Figure 1). Because the lactose stereoisomers in the 
two entries have different InChls, neither UniChem nor 
CTS can map between them. 

On average, a slightly higher number of input identifiers 
were associated with at least one output identifier with 
CTS than with UniChem. CTS, however, also returned 
a higher number of non-preferred identifiers, and thus 
received a lower overall score (Table 1). These differences 
between CTS and UniChem are attributable to two fac- 
tors; the greater number of identifier types covered by 
CTS, and the checks implemented by UniChem to pre- 
vent errors in their database (see Section Applications), 
The fact that CTS covers chemical names as an identi- 
fier type has a particularly large effect. Chemical names 
are ambiguous identifiers and the same name can be asso- 
ciated with a number of different structures [28]. Using 
names as input identifiers will therefore often result in a 
long list of candidate output identifiers (see Additional file 
1: Table SI). Referring to the previous example, the name 
lactose is associated with different structures in KEGG 
Compound and HMDB. Inputting the name lactose into 
CTS will return all identifiers that are associated with 
either structure. The number of incorrect output iden- 
tifiers returned for names is also generally higher than 
for other input types, because names are more frequently 
associated with an incorrect structure [28]. When chemi- 
cal name is included as an input identifier type (Figure 2a), 
the number of incorrect identifiers returned by CTS is 
much greater. When only identifiers that are covered by 
UniChem are considered (Figure 2b), CTS results are 
more similar to those of UniChem. The remaining differ- 
ence between the two applications is presumably due to 
the quality checks implemented by UniChem. 
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OH 



(a) Lactose from KEGG 




OH 

(b) Lactose from HMDB 

Figure 1 Lactose stereoisomers. Two epimers of lactose occur in 
nature, cHactose and /Hactose. The epimers differ by the 
configuration of structural groups around a single stereogenic carbon 
atom (top right), (a) In KEGG Compound the synonyms lactose and 
milk sugar are assigned to a generic stereoisomer, where the 
configuration around this stereogenic carbon is not specified 
(C00243). Reactions, enzymes and pathways involving lactose are 
linked to this entry in KEGG. (b) The same synonyms and most 
lactose-related data are linked to the a-epimer in HMDB 
(HMDB001 86). There is neither an entry for the generic stereoisomer 
in HMDB, nor an entry for the a-epimer in KEGG Compound. 



MetMask returned the greatest number of preferred 
identifiers on average, but it also returned the greatest 
number of non-preferred identifiers. It received a score 
in between those of UniChem and CTS (Table 1). Met- 
Mask returned the greatest number of output identifiers 
on individual mapping tests but when all tests were com- 
bined, MetMask returned fewer unique identifiers than 
CTS (Figure 2). The reason for this is that MetMask gen- 
erally returns the same set of output identifiers, when it 
is queried with different input identifiers for the same 
compound. MetMask will, for example, return the same 
set of ChEBI ID whether it is queried with the KEGG 
Compound ID (CID) for lactose or the HMDB ID. The 
ChEBI ID for both lactose stereoisomers in Figure 1 will be 
returned for either query. This is a consequence of the way 
metabolite masks are defined. All identifiers for lactose 
belong to the same mask. Querying MetMask with any 
identifier in a particular mask will always yield all iden- 
tifiers of the requested output type that are in the same 
mask. Queries with different types of identifiers for the 
same compounds are therefore not independent. This also 
explains why the difference between the total number of 
unique identifiers returned by MetMask, when all tests are 
included (Figure 2a) and when only UniChem-compatiple 
tests are included (Figure 2b), is so much smaller than for 
CTS. 

Taken together, these results demonstrate that regard- 
less of the choice of mapping application, some manual 
effort is required for proper annotation of metabolites 
with database identifiers. With MetMask this effort is 
mainly directed at sorting through candidate output iden- 
tifiers to locate the preferred ones. With UniChem it is 
directed at searching for missing identifiers. Some effort 
is required for both sorting and gap filling with CTS, 
although less for each than with the other two applica- 
tions. In the next section we seek ways to minimize this 
manual effort. 



800 
700 
600 
500 
400 
300 

MetMask CTS UniChem faWW MetMask CTS UniChem 

(a) (b) 

Figure 2 Identifiers output in identifier mapping tests. Annotations of unique identifiers returned by each application, (a) when all mapping 
tests are included, and (b) when only tests involving identifier types covered by UniChem are included. The output identifiers returned in all 
included tests were pooled and duplicates removed. If the same identifier was returned in more than one test it was only counted once. The 
annotations are explained in Section Scoring. 
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Optimization of performance 

In each individual mapping test described in the pre- 
vious section, a single identifier was used as input for 
each compound. Metabolites in metabolic reconstruc- 
tions, however, are often annotated with more than one 
identifier, e.g., a name and an HMDB ID. Using all avail- 
able identifiers to map to missing ones should be more 
powerful than using only one at a time. If one identifier for 
a given compound does not map to any identifier of the 
requested output type, a different identifier for the same 
compound may be used to fill that gap. If multiple identi- 
fiers of different types map to the same output identifier, 
our confidence in that output identifier is increased. Such 
additive mapping of identifiers is only useful, however, if 
the different input identifiers do not all yield the same 
outputs, i.e., if queries with different input identifiers are 
independent. As discussed in the previous section, this is 
not the case for MetMask. We therefore only tested this 
method on CTS and UniChem. 

We investigated the effects of combining outputs for two 
to six different types of input identifiers with CTS and two 
to four different types of input identifiers with UniChem. 
The order, in which input identifiers were added, was 
determined based on the numbers of identifiers already 
available in Recon 2. All metabolites in Recon 2 have 
a name, so we initially used only names. In the second 
iteration, we combined outputs for names and standard 
InChlKeys, since standard InChlKeys were the second 
most common identifier type in Recon 2. In each subse- 
quent iteration, we added the next-most-common type of 
input identifier until outputs for all input types had been 
combined. After each iteration, we assigned a confidence 
score to each returned output identifier that was increased 
each time that same output identifier was returned. For 
each metabolite, we only retained output identifiers with 
the highest confidence score out of all identifiers returned 
for that same metabolite. The confidence score assigned 
to each output identifier was increased by 0.5, if the 
identifier was returned with name as the input type and 
1 otherwise. Our confidence in identifiers returned for 
names was lower than for other identifiers because, as dis- 
cussed above, the number of non-preferred and incorrect 
identifiers returned for names was higher. 

The overall results of each iteration of the additive map- 
ping tests are quantified in Table 2. The mean score for 
CTS was significantly higher on these tests, than on any 
individual mapping test (Table 1). In fact, CTS scored 
higher on the additive mapping tests than any of the three 
applications did on individual mapping tests, even when 
outputs for only two types of input identifiers (names 
and standard InChlKeys) were combined. The mean score 
for UniChem also increased as input identifiers were 
added, although less than for CTS. When multiple input 
identifiers were combined, CTS and UniChem received 



Table 2 Quantified overall performance of CTS and 
UniChem on the additive identifier mapping tests 

Mean counts 







In 


Hits 


Out 


Matches 


Mean 
score 


Nsme only 


CTS 

UniChem 


100 
NA 


67 
NA 


141 
NA 


63 
NA 


0.30 
NA 


+ InChlKey 


CTS 

UniChem 


100 
100 


95 
80 


112 

83 


87 
78 


0. 
0.75 


+ ChEBI 


CTS 

UniChem 


100 
100 


96 
84 


118 

88 


93 
82 


0.75 
0.78 


+ HMDB 


CTS 

UniChem 


100 
100 


96 
85 


113 

88 


89 
82 


0.75 
0.79 


+ KEGG 


CTS 

UniChem 


100 
100 


97 
87 


116 
91 


93 
85 


0.78 
0.82 


+ PubChem 


CTS 

UniChem 


100 
NA 


97 
NA 


117 
NA 


94 
NA 


0.78 
NA 



Results of individual tests are given in Additional file 1 : Table S2. NA implies that 
the input identifier type was not covered by the corresponding application. 
PubChem refers to the PubChem Compound database and KEGG to the KEGG 
Compound database. 



similar scores, but for different reasons. UniChem con- 
tinued to return only the preferred output identifier for 
most metabolites but, even with four input identifier types 
combined, it did not return an output identifier for all 
metabolites. CTS, on the other hand, returned at least one 
candidate output identifier for most metabolites. The pre- 
ferred identifier was generally amongst those candidates 
but several non-preferred identifiers were also returned. 
In our opinion, the results of combining input identi- 
fiers were qualitatively better for CTS than for UniChem. 
Despite the fact that some manual effort is required to sort 
through candidate output identifiers returned by CTS, it 
is reassuring to know that the output is relatively com- 
prehensive. Otherwise, it would be necessary to search 
manually for identifiers that may not even exist. Accepting 
and rejecting suggested identifiers is fast by compari- 
son. In the following section, we therefore use additive 
mapping with CTS to update metabolite annotations in 
Recon 2 [23]. 

Update of Recon 2 metabolite annotations 

Recon 2 includes 2,626 unique metabolites. During the 
reconstruction of Recon 2, 1,690 metabolites were anno- 
tated with a standard InChlKey, 1,125 with a ChEBI ID, 
1,040 with an HMDB ID, 396 with a KEGG CID, and 
150 with a PubChem CID. All metabolites were anno- 
tated with a metabolite name. We updated metabolite 
identifiers in Recon 2 using additive mapping with CTS 
as described in the previous section. We used CTS both 
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to review existing annotations and to add as many new 
ones as possible. In addition to updating identifier types 
that were already present in Recon 2, we added identi- 
fiers for the LIPID MAPS Structure Database (LMSD) 
[32] which were previously missing. To speed up the 
process of updating identifiers, we took advantage of 
the extensive metabolite annotations that are available in 
HMDB [14,15]. The step-by-step process is described in 
Additional file 1: Section 2. We added a total of 3,049 
new identifiers to Recon 2, removed 124 incorrect identi- 
fiers, and replaced 569 identifiers. We therefore updated 
a total of 3,746 identifiers. All except 233 identifiers 
were updated automatically. These 233 identifiers were 
selected manually from a list of 2,790 candidates that were 
returned by CTS. Manually sorting through the list of can- 
didates required approximately ten man-hours. A random 
sample of 100 automatically updated identifiers were also 
checked manually. All 100 were found to be correct. The 
full list of updated metabolite annotations in Recon 2 is 
included in the Additional file 2. 

The majority (1,962/2,660) of added identifiers were 
KEGG and PubChem CID (Figure 3a), which were previ- 
ously lacking in Recon 2. We also added LMSD ID for 389 
metabolites. Around half (65/124) of all incorrect identi- 
fiers were PubChem CID. The remaining half was evenly 
distributed among ChEBI ID, HMDB ID and KEGG CID. 
Incorrect identifiers were identified and removed auto- 
matically (see Additional file 1: Section 2). The majority 
of replaced identifiers (464/569) were ChEBI ID. In most 
cases, we replaced a ChEBI ID for a charged metabolite, 
with a ChEBI ID for the same metabolite in its neu- 
tral state. ChEBI and PubChem Compound often include 
separate entries for metabolites in neutral and various 
charged states, but HMDB , KEGG Compound and LMSD 
usually only include metabolites in their neutral state. 
For the sake of mapping and other comparisons between 
databases, it is therefore preferable to include identi- 
fiers for metabolites in their neutral states in metabolic 
reconstructions. If metabolite charge is required, it can 



be predicted with software tools, such as ChemAxons 
Calculator Plugins (ChemAxon Kft., Budapest, Hungary). 

Most of the added identifiers (2,991/3,049) were for 
metabolites that were already annotated with at least one 
identifier besides a chemical name in Recon 2 (Figure 3b). 
Non-name identifiers were added for 28 metabolites that 
were previously only annotated with a name. That leaves 
594 Recon 2 metabolites with name as the only annota- 
tion. There are two possible reasons for this; either these 
metabolites are not included in the CTS database, or the 
synonyms used for them in Recon 2 is not included. The 
CTS name search is currently only capable of match- 
ing names exactly, so even slight differences between a 
metabolite name in Recon 2 and the synonyms listed for 
that metabolite in CTS would prevent finding a match. 
A majority (432/594) of the metabolites for which no 
identifier was found consisted of macromolecules and 
metabolites with variable structures (i.e., R groups), such 
as polysaccharides and proteins. Such metabolites are sel- 
dom included in the chemical databases considered here, 
so it is not surprising that no identifiers were found. In 
addition, metabolites with variable structures cannot be 
represented with an InChl. An InChI based application 
such as CTS therefore cannot cover them. The remaining 
metabolites (162/594) are more likely to be included in the 
databases considered here. If they are, they must be regis- 
tered under different synonyms than the names included 
in Recon 2. Non-name identifiers for these metabolites 
will need to be searched for manually. 

Discussion 

The three applications compared in this work implement 
two different metabolite annotation strategies. MetMask 
implements what can be termed a "metabolite mask- 
ing" strategy (see Section MetMask), whereas CTS and 
UniChem implement two different versions of an InChI 
based strategy (see Sections The Chemical Translation 
System and UniChem), We found the InChI based strat- 
egy to be better suited for annotation of metabolites 
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Figure 3 Recon 2 identifiers. Identifier statistics for Recon 2 before and after metabolite annotations were updated with CTS. (a) Number of 
unique metabolites with each of the seven types of identifiers, n: names, i: InChlKeys, c: ChEBI ID, h: HMDB ID, k: KEGG CID, p: PubChem CID, I: 
LipidMAPS ID. (b) Number of unique metabolites with one, and up to seven, identifiers each. 



Haraldsdottir etal. Journal of Cheminformatics 2014, 6:2 
http://www.jcheminf.eom/content/6/1/2 



Page 8 of 12 



in genome-scale metabolic reconstructions. The main 
advantage of this strategy over the metabolite masking 
strategy is that multiple types of information about a 
metabolite can be used to perform independent searches 
for missing annotations. The independent search results 
can then be combined to increase both coverage and 
specificity. Candidate annotations are thereby found for 
a greater number of metabolites (increased coverage), 
while candidate annotations for each metabolite are fewer 
and usually include the preferred annotations (increased 
specificity). Candidate annotations, which are returned 
by multiple independent searches, can additionally be 
assigned in an automatic manner with more confidence. 
Neither the CTS nor UniChem user interfaces currently 
offer the possibility of combining multiple types of 
information to search for missing annotations. Here we 
performed each search separately and combined results 
afterwards. This process was slow and required a consid- 
erable amount of programming. An InChI based applica- 
tion that allows simultaneous input of multiple types of 
metabolite information would greatly simplify and accel- 
erate annotation. 

When multiple independent search results were com- 
bined, the InChI based strategy implemented in CTS 
gave qualitatively better results than the version imple- 
mented in UniChem. Although UniChem gave more specific 
results, CTS covered a greater number of metabolites. 
The main advantage of CTS over UniChem is that it 
can map between a greater number of identifier types. 
We chose to map between a limited number of identifier 
types, that are relevant for the human metabolic recon- 
struction Recon 2, but the same strategy could be used to 
map between any of the 215 types of identifiers covered by 
CTS. 

The greater metabolite coverage of CTS was mostly due 
to the fact that CTS allows chemical names as inputs. 
This was also the main reason for the lower specificity of 
CTS results. Chemical names are rather generic metabo- 
lite identifiers, or at least they are used rather generically 
in chemical databases. The same name is often associated 
with multiple different structures, sometimes incorrectly 
[28]. When names are input to an InChI based map- 
ping application, identifiers for different but equally valid 
structures may be returned leading to increased cov- 
erage. However, identifiers for nonpreferred, invalid or 
even incorrect structures may also be returned leading to 
reduced specificity. Inputting the name lactose to CTS, for 
example, will return both a KEGG CID and an HMDB ID, 
for different but equally valid lactose stereoisomers (see 
Figure 1). However, it will also return a total of four Pub- 
Chem CID, one of which is invalid as it refers to a generic 
disaccharide (PubChem CID 294). To retain the coverage 
obtained with chemical names as inputs, while minimiz- 
ing the adverse effects it has on specificity, we introduced 



a confidence score that gave annotations returned for 
names a lower priority. A similar mechanism could be 
built into the metabolite annotation application suggested 
above, where multiple types of metabolite information 
could be input simultaneously as search criteria. 

Although the fact that CTS allows names as inputs 
explains most of the difference between the specificity 
of CTS and UniChem, it does not explain all of it (see 
Figure 2 and Additional file 1: Table SI). Some of this dif- 
ference is also due to the quality checks performed by 
UniChem before data from external databases is imported 
into the UniChem database (see Section UniChem), Any 
InChI based application would benefit from similar qual- 
ity checks. A recent study [33] showed that different struc- 
tural representations (Molfiles, InChls, SMILES) within 
the same database entry often do not represent the same 
structure. Such mismatches are indicative of errors in 
database entries. Quality checks such as the ones imple- 
mented in UniChem hinder such errors from being prop- 
agated to local databases for annotation applications. 
Additional quality checks could also be performed, such 
as checking whether the two dimensional structure (e.g., 
in Molfile format) and the chemical formula in an exter- 
nal database entry match the standard InChI. Chemical 
formulas can also be used to check candidate annota- 
tions returned for metabolites and to weed out incorrect 
ones. All metabolites in metabolic network reconstruc- 
tions should be annotated with their chemical formu- 
las. We used metabolite formulas in Recon 2 to review 
database identifiers that were added to the reconstruc- 
tion (see Additional file 1: Section 2). If the metabolite 
formula in the database entry associated with an identi- 
fier did not match the formula in Recon 2, we assumed 
the identifier was incorrect and rejected it. Differences 
in numbers of hydrogen atoms between formulas were 
ignored. The metabolite annotation application suggested 
above could include metabolite formulas as one type of 
input information about metabolites. Candidate identi- 
fiers associated with different formulas than the input for- 
mula would then be rejected before they were added to the 
reconstruction. 

The coverage of any InChI based application is lim- 
ited to metabolites with defined structures that can be 
represented with InChls. Metabolic reconstructions often 
include generic metabolites with undefined structural ele- 
ments such as R groups. These generic metabolites rep- 
resent whole classes of structurally similar metabolites 
that undergo the same metabolic transformations in vivo. 
They are introduced into reconstructions for simplifi- 
cation. Such generic metabolites cannot be represented 
with an InChI and therefore cannot be covered by an 
InChI based metabolite annotation application. An anno- 
tation application based on a metabolite masking strategy 
would be better suited to mapping between identifiers for 
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such metabolites. The fact that an InChI based applica- 
tion cannot cover generic metabolites of this type does 
not decrease its value much, since these metabolites are 
generally a minority of all metabolites in metabolic recon- 
structions and only a minority of those is expected to be 
included in chemical databases of interest. 

Generic metabolites of a different type that are also 
found in metabolic reconstructions are generic stereoiso- 
mers, i.e., metabolites with undefined stereochemistry at 
one or more stereocenters. An example found in Recon 2 
is lactose in Figure la. As enzymes that catalyze metabolic 
reactions involving lactose generally dont have known 
stereospecificity, both the a- and /3-epimers are assumed 
to participate in the same reactions. Instead of needlessly 
complicating the reconstruction by writing the same reac- 
tions twice, the two epimers are collapsed into a single 
generic stereoisomer. The metabolic component lactose 
in Recon 2 therefore encompasses both a- and /3 -lactose. 
Similar examples are frequent in fatty acid metabolism 
where cis-trans isomerism is not specified unless neces- 
sary, i.e., for fatty acids that participate in reactions that 
are catalyzed by enzymes with known stereospecificity. 

Generic stereoisomers can be represented with an 
InChI and can therefore be covered by an InChI based 
application. CTS often returned several candidate iden- 
tifiers for such metabolites that needed to be sorted 
through manually to select the preferred ones. One rea- 
son for this is that the names of generic stereoisomers 
in Recon 2 are often associated with more specific or 
even more generic (and thus invalid) stereoisomers in the 
databases considered here. As discussed above, the name 
lactose is associated with a -lactose (more specific) in 
HMDB and a disaccharide with no specified stereochem- 
istry (more generic) in PubChem among other database 
identifiers. Another reason for why CTS often returned 
several candidate identifiers for generic stereoisomers was 
that preexisting annotations of these metabolites in Recon 
2 were sometimes rather ambiguous. Lactose in Recon 
2, for example, was annotated with the KEGG CID for 
the generic stereoisomer (Figure la) and the HMDB ID 
for the a-epimer (Figure lb). When these two identifiers 
were used in combination to find the PubChem CID for 
lactose, CTS naturally returned PubChem CID for both 
stereoisomers. This raises the question of how generic 
stereoisomers should be annotated in metabolic recon- 
structions. The general rule should be to annotate each 
metabolite with the most generic identifier of each type 
that is still valid. Lactose therefore should be annotated 
with identifiers for the generic stereoisomer in Figure la. 
However, there is no entry for this lactose stereoisomer 
in HMDB. Instead, HMDB includes separate entries for 
the more specific a- and -epimers of lactose. In such 
cases, the general rule in the past appears to have been 
to select the identifier for the most prevalent specific 



stereoisomer, e.g., the HMDB ID for a-lactose, as rel- 
evant biochemical data is more likely to be associated 
with that identifier. The advantage of annotating generic 
stereoisomers with identifiers for more specific ones is 
precisely that they provide links to such data. The dis- 
advantage is that the identity of metabolic components 
becomes somewhat ambiguous. It may, for example, not 
be obvious to all users of Recon 2 whether the metabolic 
component lactose represents the generic stereoisomer or 
only a -lactose since it is annotated with identifiers for 
both. 

Reconstruction of the metabolic network of an organ- 
ism or cell type is an iterative process. Recon 2 is the 
latest iteration of the human metabolic network recon- 
struction. While Recon 2 is much more comprehen- 
sive than its predecessor Recon 1 [34], it probably does 
not capture the entire human metabolic network and 
further iterations are expected in the future [23]. Our 
results suggest some guidelines for researchers to keep in 
mind when annotating new metabolites that are added to 
metabolic reconstructions, human or otherwise. Firstly, 
each metabolite should be annotated with at least one 
identifier besides name if that is possible. Secondly, each 
metabolite should generally be annotated with identifiers 
for its neutral form. Exceptions exist for metabolites that 
only participate in metabolic reactions in a particular 
charged state, e.g. inorganic ions such as Cl~ and Mg 2+ . 
Thirdly, each metabolite should preferably be annotated 
with the most generic identifier of each type that is still 
valid. This also applies to metabolite names. The name D- 
glucose, for example, should not be used for a metabolic 
component that is meant to represent a-D-glucose or 
vice versa. If no generic identifier of a particular type 
is available for a metabolite, a more specific identifier 
may be used. Researchers should, however, be aware that 
doing so makes the identity of that metabolite somewhat 
ambiguous. Best practices would be to include a note that 
specifies the relationship of an identifier to a metabolic 
component. The ChEBI ontology could serve as a guide- 
line for how relationships between metabolites should be 
specified. So, for example, it would be noted that the 
metabolite with HMDB ID HMDB00186 (Figure lb) is a 
Lactose. 

An InChI based metabolite annotation application 
has the potential to enable fully automatic mapping 
between identifiers for metabolites with defined struc- 
tures. Fully automatic mapping, however, would require 
that both databases and reconstructions were free of 
errors and ambiguity. As several authors have demon- 
strated [28,33,35], errors and ambiguities are quite com- 
mon in publicly available chemical databases. If a database 
identifier is associated with an incorrect InChI in a 
database it will not be mapped to the correct metabolite 
in a reconstruction. As we demonstrated here, erroneous 
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metabolite annotations are also found in metabolic 
reconstructions. In particular, we removed 124 incorrect 
identifiers from Recon 2. As discussed above metabo- 
lite annotations in metabolic reconstructions can also be 
ambiguous, e.g., when a metabolite is annotated with the 
name of one stereoisomer but the KEGG CID of another. 
If a metabolite is annotated with an incorrect identifier 
in a metabolic reconstruction it may not be mapped to 
the correct database or structure based identifier. Fully 
automatic mapping between metabolite identifiers will 
not be possible until such errors and ambiguities are 
resolved. An InChI based application such as CTS, how- 
ever, with the modifications suggested above, can signif- 
icantly reduce the manual effort required for mapping 
between metabolite identifiers. 

Experimental 

CTS was accessed by calling the "Convert" web ser- 
vice as described at http://cts.fiehnlab.ucdavis.edu/ 
moreServices/index. UniChem was accessed by calling 
the web service method "Get src_compound_ids from 
src_compound_id" as described at https://www.ebi.ac. 
uk/unichem/info/webservices. Web services were called 
from MATLAB (version R2009b, MathWorks, Natick, 
MA) using the built-in function urlread. MetMask (ver- 
sion 0.5.3) was installed on a desktop computer running 
Windows 7. MetMask databases were created by import- 
ing identifier groups from Recon 2, HMDB and ChEBI. 
Database queries were formulated as described at http:// 
metmask.sourceforge.net/manual.html with all options 
set to their default values. Output from all mapping 
applications was parsed in MATLAB using the built-in 
function regexp. 

Conclusions 

We found that an application implementing an InChI 
based strategy could facilitate automatic mapping 
between metabolite identifiers in metabolic network 
reconstructions. Of the two InChI based applications 
evaluated here we found CTS to be qualitatively better. 
The main advantage of CTS is the large number of iden- 
tifier types it can map between. As CTS is open source it 
can be adapted to the task of mapping between metabolite 
identifiers in metabolic network reconstructions with rel- 
ative ease. We suggest several features that could be added 
to CTS to optimize its performance on this task. In partic- 
ular, we suggest combining multiple types of information 
about metabolites to find new identifiers. A confidence 
score can be used to account for the fact that some types 
of input information, in particular metabolite names, are 
less reliable than others. We further suggest implement- 
ing various quality checks, similar to those implemented 
in UniChem, to limit the number of incorrect identifiers 
returned for a metabolite. When simple versions of some 



of the suggested features were implemented, CTS allowed 
us to update more than 3,500 metabolite identifiers in 
Recon 2. Most were updated automatically. Based on 
this experience, we suggest some guidelines for future 
annotation of metabolites in metabolic network recon- 
structions. We hope that the updated Recon 2 identifiers 
will facilitate application of Recon 2 in the future. More 
generally, we hope that our results will guide developers 
of reconstruction tools in implementing strategies for 
automatically updating metabolite identifiers in metabolic 
network reconstructions. 

Methods 

Design of identifier mapping tests 

MetMask and CTS were tested by mapping six types of 
input identifiers to four types of output identifiers. The 
six input identifier types were metabolite name, stan- 
dard InChlKey, ChEBI ID, HMDB ID, KEGG CID and 
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Figure 4 Annotation of output identifiers. An example 
demonstrating annotation of output PubChem Compound identifiers 
(b-e), when the KEGG Compound identifier for D-glucose (a) is input 
to a mapping application. The preferred output identifier is for 
D-glucose (b), but an identifier for alpha-D-glucose (c) is also valid 
since it is a D-glucose. An identifier for a generic hexose (d), however, 
is not valid. Finally, an identifier for phospholactic acid (e), which is a 
completely different compound, is incorrect. 
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PubChem CID. Output identifier types were the subset of 
input types that act as primary keys in public databases, 
i.e., ChEBI ID, HMDB ID, KEGG CID and PubChem CID. 
Output identifiers of these four types can easily be veri- 
fied by looking them up in the relevant databases. Since 
metabolite names and InChlKeys are more difficult to ver- 
ify, they were not included as output identifier types. In 
total, we tested MetMask and CTS on 20 pairs of input- 
output identifier types, since input identifiers were not 
mapped to output identifiers of the same type. ChEBI ID 
for example, were only mapped to HMDB ID, KEGG CID 
and PubChem CID. UniChem was tested on nine input- 
output pairs, since it does not cover metabolite names or 
PubChem CID. All three applications were tested on a 
set of 100 metabolites from the human metabolic recon- 
struction Recon 2 [23]. The metabolites were chosen ran- 
domly from the subset of Recon 2 metabolites that were 
already annotated with at least two of the four database 
identifiers. For each metabolite, we verified the existing 
annotations and attempted to fill in missing annotations 
of the remaining four input types. The end result was 
100 metabolite names, 100 InChlKeys, 98 ChEBI ID, 100 
HMDB ID, 97 KEGG CID and 100 PubChem CID. A min- 
imum of five identifiers were located for each of the 100 
test metabolites. 

Scoring 

To quantify the relative performance of the three map- 
ping applications, we devised a simple scoring system. 
For each input-output pair, each mapping application 
returned a list of candidate output identifiers associ- 
ated with the set of input identifiers. The number of 
output identifiers associated with a single input iden- 
tifier ranged from zero to several. We annotated each 
returned output identifier as preferred, valid, invalid, 
incorrect or nonexistent (see Figure 4 for an example). 
Preferred identifiers point to the preferred stereoisomer 
of each compound, which is generally the same as the 
input stereoisomer. There is exactly one preferred output 
identifier for each input identifier. Valid identifiers point 
to valid but not preferred stereoisomers, invalid identi- 
fiers point to invalid stereoisomers or mixtures, incorrect 
identifiers point to different compounds, and nonexis- 
tent identifiers do not point to anything. Once all output 
identifiers had been annotated in this manner, we calcu- 
lated a score based on the number of input identifiers 
(In), the number of input identifiers for which at least 
one output identifier was returned (Hits), the total num- 
ber of returned output identifiers (Out), and the number 
of preferred output identifiers (Matches). The score was 
calculated as 



This score can range from 0 to 1. An application receives 
the maximum score if it returns the preferred output iden- 
tifier, and no other, for each input identifier. It receives 
a lower score if it returns non-preferred identifiers, or 
none at all, for a subset of input identifiers, since some 
manual effort is then required to sort through results and 
fill gaps. Note that the number of input identifiers (In) 
varies between identifier types, because two of the 100 test 
metabolites were not found in ChEBI (In = 98) and three 
were not found in KEGG Compound (In = 97). 
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